High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lie on an embedded non-linear manifold within the higher-dimensional space. If the manifold is of low enough dimension, the data can be visualised in the low-dimensional space.
Map $x\rightarrow y$ preserving distances as much as possible.

print("Typical datasets for dimensionality reduction evaluation")
typicalDatasets.show()
Typical datasets for dimensionality reduction evaluation
Comment : True datasets have much more dimensions, more complex structure, errors, outliers etc.
Issue: small $||x_i - x_j||$ should not always imply small $||y_i - y_j||$
Isomap : Map $x \Rightarrow y$ preserving correspondence beetwen distance in target space and geodesic distance along the surface in original space.

Algorithm : There are four stages to Isomap:
from sklearn.datasets import fetch_mldata
import numpy as np
mnist = fetch_mldata('MNIST original', data_home="./")
numberImages = []
for i in range(48):
numberImages.append((mnist.data[1250 * i].reshape(28, 28), ''))
images = RowImages(numberImages, columns = 8)
images.show(cmap='gray_r')
from sklearn.manifold import Isomap
# use only 1/10 of the data: full dataset takes a long time!
data = mnist.data[::10]
target = mnist.target[::10]
model = Isomap(n_components=2)
proj = model.fit_transform(data)
plt.figure(figsize=(16, 8))
plt.scatter(proj[:, 0], proj[:, 1], c=target, cmap=plt.cm.get_cmap('jet', 10))
plt.colorbar(ticks=range(10))
plt.clim(-0.5, 9.5);
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(16, 6))
for i in range(2):
plots[i].plotWithImages(ax[i], cmap='gray_r', thumb_frac=0.08)
The result gives you an idea of the variety of forms that the number "1" can take within the dataset. The data lies along a broad curve in the projected space, which appears to trace the orientation of the digit. As you move up the plot, you find ones that have hats and/or bases, though these are very sparse within the dataset. The projection lets us identify outliers that have data issues: for example, pieces of the neighboring digits that snuck into the extracted images.
fig, ax = plt.subplots(1, 2, sharey=True, figsize=(16, 6))
for i in range(2):
plots[i + 2].plotWithImages(ax[i], cmap='gray_r', thumb_frac=0.1)
otherFig, otherAx = plt.subplots(3, 2, sharey=True, figsize=(16, 10))
for i in range(3):
for j in range(2):
otherNumberPlots[i * 2 + j].plotWithImages(otherAx[i][j], cmap='gray_r', thumb_frac=0.08)
facesImages = []
for i in range(60):
facesImages.append((faces.images[i], ''))
images = RowImages(facesImages, columns = 12)
images.show(cmap='gray')
figFaces, axFaces = plt.subplots(figsize=(16, 8))
facesPlot.plotWithImages(axFaces, images=faces.images[:, ::2, ::2])
The result is interesting: the first two Isomap dimensions seem to describe global image features: the overall darkness or lightness of the image from left to right, and the general orientation of the face from bottom to top. This gives us a nice visual indication of some of the fundamental features in our data.
putinModel = Isomap(n_components=2)
putinPlot = PlotComponents(putinData, putinModel)
figPutin, axPutin = plt.subplots(figsize=(16, 8))
putinImages = faces.images[faces.target==putinIndex]
putinPlot.plotWithImages(axPutin, images=putinImages, thumb_frac=0.0, reshape=())
from sklearn.datasets import fetch_olivetti_faces
olivettiDataset = fetch_olivetti_faces()
olivettiImages = []
for i in range(60):
olivettiImages.append((olivettiDataset.images[i], ''))
images = RowImages(olivettiImages, columns = 12)
images.show(cmap='gray')
print(olivettiDataset.data.shape)
(400, 4096)
figOlivetti, axOlivetti = plt.subplots(figsize=(16, 8))
olivettiPlot.plotWithImages(axOlivetti, images=olivettiDataset.images)
Return to digits images. We have 70000 data points, each an 28x28 image -> 784 dimensional vector.
X = mnist.data[::30]
y = mnist.target[::30]
print(X.shape)
print(y.shape)
(2334, 784) (2334,)
To evaluate the algorithm, split data into training and testing part
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
print("X_train shape: %s" % repr(X_train.shape))
print("y_train shape: %s" % repr(y_train.shape))
print("X_test shape: %s" % repr(X_test.shape))
print("y_test shape: %s" % repr(y_test.shape))
X_train shape: (1750, 784) y_train shape: (1750,) X_test shape: (584, 784) y_test shape: (584,)
isomapModel = Isomap(n_components=2, n_neighbors=5)
fittedIsomapModel = isomapModel.fit(X_train)
projection = fittedIsomapModel.transform(X_train)
plt.figure(figsize=(16, 8))
plt.scatter(projection[:, 0], projection[:, 1], c=y_train, cmap=plt.cm.get_cmap('jet', 10));
testProjection = fittedIsomapModel.transform(X_test)
plt.figure(figsize=(16, 8))
plt.scatter(testProjection[:, 0], testProjection[:, 1], c=y_test, cmap=plt.cm.get_cmap('jet', 10));
import math
def distance(p1, p2):
return math.sqrt((p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2)
success = 0
k = 9
for i in range(len(testProjection)):
testPoint = testProjection[i]
testPointLabel = y_test[i]
#print("Point: ", testPoint, ", Label: ", testPointLabel)
testMinsForPoint = list(enumerate(list(map(lambda p: distance(p, testPoint), projection))))
kNeighbours = sorted(testMinsForPoint, key=lambda t: t[1])
distances = list(map(lambda n: n[1], kNeighbours[:k]))
indices = list(map(lambda n: n[0], kNeighbours[:k]))
labels = list(map(lambda i: y_train[i], indices))
m = {}
for i, label in enumerate(labels):
if (m.get(label) == None):
m[label] = 1.0 / distances[i]
else:
m[label] += 1.0 / distances[i]
if max(m, key=m.get) == testPointLabel:
success += 1
#print("Test: ", y_train[classificationIndex])
print("Success rate", float(success) / len(testProjection))
Success rate 0.559931506849315
isomapModel3 = Isomap(n_components=3, n_neighbors=5)
fittedIsomapModel3 = isomapModel3.fit(X_train)
projection3 = fittedIsomapModel3.transform(X_train)
testProjection3 = fittedIsomapModel3.transform(X_test)
from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(111, projection='3d')
_ = ax.scatter(projection3[:, 0], projection3[:, 1], projection3[:, 2], c=y_train, cmap=plt.cm.get_cmap('jet', 10))
fig = plt.figure(figsize=(16, 8))
ax = fig.add_subplot(111, projection='3d')
_ = ax.scatter(testProjection3[:, 0], testProjection3[:, 1], testProjection3[:, 2], c=y_test, cmap=plt.cm.get_cmap('jet', 10))
def distance3(p1, p2):
return math.sqrt((p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2 + (p1[2] - p2[2]) ** 2)
success3 = 0
k = 9
for i in range(len(testProjection3)):
testPoint = testProjection3[i]
testPointLabel = y_test[i]
#print("Point: ", testPoint, ", Label: ", testPointLabel)
testMinsForPoint = list(enumerate(list(map(lambda p: distance3(p, testPoint), projection3))))
kNeighbours = sorted(testMinsForPoint, key=lambda t: t[1])
indices = list(map(lambda n: n[0], kNeighbours[:k]))
distances = list(map(lambda n: n[1], kNeighbours[:k]))
labels = list(map(lambda i: y_train[i], indices))
m = {}
for i, label in enumerate(labels):
if (m.get(label) == None):
m[label] = 1.0 / distances[i]
else:
m[label] += 1.0 / distances[i]
if max(m, key=m.get) == testPointLabel:
success3 += 1
#print("Test: ", y_train[classificationIndex])
print("Success rate for n_components=3: ", float(success3) / len(testProjection3))
Success rate for n_components=3: 0.6643835616438356